Goto

Collaborating Authors

 confidence calibration





Be Confident! Towards Trustworthy Graph Neural Networks via Confidence Calibration

Neural Information Processing Systems

Despite Graph Neural Networks (GNNs) have achieved remarkable accuracy, whether the results are trustworthy is still unexplored. Previous studies suggest that many modern neural networks are over-confident on the predictions, however, surprisingly, we discover that GNNs are primarily in the opposite direction, i.e., GNNs are under-confident. Therefore, the confidence calibration for GNNs is highly desired. In this paper, we propose a novel trustworthy GNN model by designing a topology-aware post-hoc calibration function. Specifically, we first verify that the confidence distribution in a graph has homophily property, and this finding inspires us to design a calibration GNN model (CaGCN) to learn the calibration function. CaGCN is able to obtain a unique transformation from logits of GNNs to the calibrated confidence for each node, meanwhile, such transformation is able to preserve the order between classes, satisfying the accuracy-preserving property. Moreover, we apply the calibration GNN to self-training framework, showing that more trustworthy pseudo labels can be obtained with the calibrated confidence and further improve the performance. Extensive experiments demonstrate the effectiveness of our proposed model in terms of both calibration and accuracy.


Thinking, Faithful and Stable: Mitigating Hallucinations in LLMs

Zou, Chelsea, Yao, Yiheng, Khalil, Basant

arXiv.org Artificial Intelligence

This project develops a self correcting framework for large language models (LLMs) that detects and mitigates hallucinations during multi-step reasoning. Rather than relying solely on final answer correctness, our approach leverages fine grained uncertainty signals: 1) self-assessed confidence alignment, and 2) token-level entropy spikes to detect unreliable and unfaithful reasoning in real time. We design a composite reward function that penalizes unjustified high confidence and entropy spikes, while encouraging stable and accurate reasoning trajectories. These signals guide a reinforcement learning (RL) policy that makes the model more introspective and shapes the model's generation behavior through confidence-aware reward feedback, improving not just outcome correctness but the coherence and faithfulness of their intermediate reasoning steps. Experiments show that our method improves both final answer accuracy and reasoning calibration, with ablations validating the individual contribution of each signal.


Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations

Moll, Johannes, Graf, Markus, Lemke, Tristan, Lenhart, Nicolas, Truhn, Daniel, Delbrouck, Jean-Benoit, Pan, Jiazhen, Rueckert, Daniel, Adams, Lisa C., Bressem, Keno K.

arXiv.org Artificial Intelligence

Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process, undermining trust in high-stakes clinical use. Existing evaluations rarely catch this misalignment, prioritizing answer accuracy or adherence to formats. We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration. In a reader study (n=4), evaluator-radiologist correlations fall within the observed inter-radiologist range for all axes, with strong alignment for attribution (Kendall's $τ_b=0.670$), moderate alignment for fidelity ($τ_b=0.387$), and weak alignment for confidence tone ($τ_b=0.091$), which we report with caution. Benchmarking six VLMs shows that answer accuracy and explanation quality can be decoupled, acknowledging injected cues does not ensure grounding, and text cues shift explanations more than visual cues. While some open-source models match final answer accuracy, proprietary models score higher on attribution (25.0% vs. 1.4%) and often on fidelity (36.1% vs. 31.7%), highlighting deployment risks and the need to evaluate beyond final answer accuracy.


Confidence Calibration in Large Language Model-Based Entity Matching

Kamsteeg, Iris, Cardenas-Cartagena, Juan, van Beers, Floris, Holt, Gineke ten, Tashu, Tsegaye Misikir, Valdenegro-Toro, Matias

arXiv.org Artificial Intelligence

This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.


Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Kriz, Anita, Janes, Elizabeth Laura, Shen, Xing, Arbel, Tal

arXiv.org Artificial Intelligence

Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.



Conf-Profile: A Confidence-Driven Reasoning Paradigm for Label-Free User Profiling

Li, Yingxin, Zhao, Jianbo, Ren, Xueyu, Tang, Jie, You, Wangjie, Chen, Xu, Zhou, Kan, Feng, Chao, Ran, Jiao, Meng, Yuan, Wang, Zhi

arXiv.org Artificial Intelligence

User profiling, as a core technique for user understanding, aims to infer structural attributes from user information. Large Language Models (LLMs) provide a promising avenue for user profiling, yet the progress is hindered by the lack of comprehensive benchmarks. To bridge this gap, we propose ProfileBench, an industrial benchmark derived from a real-world video platform, encompassing heterogeneous user data and a well-structured profiling taxonomy. However, the profiling task remains challenging due to the difficulty of collecting large-scale ground-truth labels, and the heterogeneous and noisy user information can compromise the reliability of LLMs. To approach label-free and reliable user profiling, we propose a Confidence-driven Profile reasoning framework Conf-Profile, featuring a two-stage paradigm. We first synthesize high-quality labels by leveraging advanced LLMs with confidence hints, followed by confidence-weighted voting for accuracy improvement and confidence calibration for a balanced distribution. The multiple profile results, rationales, and confidence scores are aggregated and distilled into a lightweight LLM. We further enhance the reasoning ability via confidence-guided unsupervised reinforcement learning, which exploits confidence for difficulty filtering, quasi-ground truth voting, and reward weighting. Experimental results demonstrate that Conf-Profile delivers substantial performance through the two-stage training, improving F1 by 13.97 on Qwen3-8B.